options(scipen = 999)
library(purrr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
set.seed(123)
Here we will perform a multivariate multiple regression analysis to predict a player’s topline basketball in-game statistics using several variables such their salary, age, team, and certain other in-game statistics. This analysis is motivated by the question: what is the relationship between a player’s salary and their in-game performance?
In this analysis, we investigate the inverse of the typical question: instead of asking how a player’s in-game performance predicts their salary, we ask how a player’s salary predicts their in-game performance. This can help us determine how related the salary is to performance, and may lead to insights in if a player is “overpaid” or “underpaid” relative to their performance.
We must use multivariate multiple regression because we have multiple response variables (e.g. points, assists, and rebounds) and multiple predictor variables (e.g. the salary, age, team, and other in-game statistics).
See the data prep markdown for more details.
We recall from lecture that one of the assumptions of linear regression is that the data is multivariate normal. This distribution is characterized by the following properties:
These are very helpful because they allow us to check for multivariate normality by checking for univariate normality, rather than considering the challenging task of checking for multivariate normality directly.
We now consider the univeriate cases. We plot the distribution of each numeric column and the salary column in order to visually inspect the distribution. Then, we use the Shapiro-Wilk test to test for normality and check if this matches our visual inspection.
# read the data from csv
per_season_data <- readr::read_csv("data/per_season_data.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## player = col_character(),
## Season = col_character(),
## Tm = col_character(),
## Team = col_character(),
## Pos = col_character(),
## Colleges = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
dim(per_season_data)
## [1] 866 31
head(per_season_data)
## # A tibble: 6 x 31
## player Season Age Tm Team Pos Ht Wt Colleges G GS MP
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 abdela… 1990-… 23.9 POR Portl… F-C 82 240 Duke 7928 53 1020
## 2 abdela… 1991-… 23.9 POR Portl… F-C 82 240 Duke 7928 53 1020
## 3 abdela… 1992-… 23.9 POR Bosto… F-C 82 240 Duke 7928 53 1020
## 4 abdela… 1993-… 23.9 POR Bosto… F-C 82 240 Duke 7928 53 1020
## 5 abdela… 1994-… 23.9 POR Sacra… F-C 82 240 Duke 7928 53 1020
## 6 abdela… Career 23.9 POR (may … F-C 82 240 Duke 7928 53 1020
## # … with 19 more variables: FG <dbl>, FGA <dbl>, FG% <dbl>, 3P <dbl>,
## # 3PA <dbl>, 3P% <dbl>, FT <dbl>, FTA <dbl>, FT% <dbl>, ORB <dbl>, DRB <dbl>,
## # TRB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>, PTS <dbl>,
## # Salary <dbl>
# handle NAs for now
per_season_data <- per_season_data %>%
mutate(
Team = ifelse(is.na(Team), "Other", Team),
Season = ifelse(is.na(Season), "Other", Season),
Salary = ifelse(is.na(Salary), mean(Salary), Salary)
)
# create a list of numeric columns
num_cols <- per_season_data %>%
select(where(is.numeric)) %>%
names()
# plot the distribution of each numeric column
for (col in num_cols) {
hist(per_season_data[[col]], main = col)
}
# plot the distribution of the salary
hist(per_season_data$Salary, main = "Salary")
We also present Q-Q plots for the variables, which demonstrats the columns that aren’t normal.
for (col in num_cols) {
qqnorm(per_season_data[[col]], main = col)
qqline(per_season_data[[col]])
# title
title(main = col)
}
We attempted to perform the Shapiro Wilk test, but it was not possible to perform the test on the size of our data - there is a limit of 5000 observations. So, we will perform the test on a sample of the data, but we will not use this test to make any final conclusions.
# shapiro wilk test
for (col in num_cols) {
print(col)
# sample the data of size 5000
x = sample(per_season_data[[col]], 5000)
print(shapiro.test(x))
}
We see clearly that many of the variables are not normally distributed and will need to be transformed. Furthermore, the salary distribution features a long tail, with substantial outliers.
Depending on the skewness of the data, we can consider polynomial transformations and log transformations. We will consider transformations for the following variables which will be used in the regression analysis:
After investigation, we see that the following transformations are suitable:
per_season_data <- per_season_data %>%
mutate(
PTS = PTS^(1/2),
AST = AST^(1/2),
TRB = TRB^(1/2),
Salary = Salary^(1/10),
#Age no transform needed
#Ht no transform needed
Wt = Wt^2,
GS = GS^(1/3),
G = G^(1/2)
#MP no transform needed
)
Now that the data is more suitable for our regression task, we can perform the regression. First, we discuss the characteristics of multivariate multiple regression.
Most notably, multivariate multiple regression is a generalization of multiple regression. In multiple regression, we have a single response variable and multiple predictor variables. In multivariate multiple regression, we have multiple response variables and multiple predictor variables.
To start, we create the following model:
\[Y = \Beta Z\]
# convert some variabels to factor for R's lm function
per_season_data <- per_season_data %>%
mutate(
Season = as.factor(Season),
Team = as.factor(Team),
Pos = as.factor(Pos)
)
mlm1 <- lm(
cbind(PTS, AST, TRB) ~ Salary + Age + Ht + Wt + Pos + GS + G + MP,
data = per_season_data
)
summary(mlm1)
## Response PTS :
##
## Call:
## lm(formula = PTS ~ Salary + Age + Ht + Wt + Pos + GS + G + MP,
## data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14638 -0.18852 -0.00722 0.18709 0.76131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.097799276 0.559859999 0.175 0.861370
## Salary 0.035454466 0.015812087 2.242 0.025211 *
## Age -0.038773455 0.005459598 -7.102 0.000000000002661 ***
## Ht -0.006335313 0.006439966 -0.984 0.325527
## Wt 0.000006346 0.000001748 3.631 0.000300 ***
## PosC-F 0.386460141 0.049725053 7.772 0.000000000000023 ***
## PosF 0.106641824 0.046375383 2.300 0.021724 *
## PosF-C 0.148916003 0.045813807 3.250 0.001199 **
## PosF-G 0.314883217 0.054999630 5.725 0.000000014469906 ***
## PosG 0.229710890 0.073478460 3.126 0.001833 **
## PosG-F 0.115774720 0.061644164 1.878 0.060719 .
## GS -0.059492747 0.016010405 -3.716 0.000216 ***
## G 0.004485254 0.000629516 7.125 0.000000000002273 ***
## MP 0.002301965 0.000075024 30.683 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2873 on 824 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.895, Adjusted R-squared: 0.8933
## F-statistic: 540.2 on 13 and 824 DF, p-value: < 0.00000000000000022
##
##
## Response AST :
##
## Call:
## lm(formula = AST ~ Salary + Age + Ht + Wt + Pos + GS + G + MP,
## data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60364 -0.13881 -0.01604 0.14352 0.64564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.433722509 0.425988596 12.756 < 0.0000000000000002 ***
## Salary -0.001347760 0.012031166 -0.112 0.9108
## Age -0.023261040 0.004154122 -5.600 0.000000029289078 ***
## Ht -0.055841745 0.004900068 -11.396 < 0.0000000000000002 ***
## Wt -0.000008881 0.000001330 -6.678 0.000000000044579 ***
## PosC-F 0.094221978 0.037835005 2.490 0.0130 *
## PosF -0.259636830 0.035286293 -7.358 0.000000000000452 ***
## PosF-C -0.168620414 0.034858999 -4.837 0.000001571373509 ***
## PosF-G -0.057699633 0.041848347 -1.379 0.1683
## PosG -0.121915579 0.055908595 -2.181 0.0295 *
## PosG-F -0.261859860 0.046904067 -5.583 0.000000032120049 ***
## GS 0.024231706 0.012182063 1.989 0.0470 *
## G 0.000912071 0.000478988 1.904 0.0572 .
## MP 0.000787670 0.000057084 13.798 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2186 on 824 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.8361, Adjusted R-squared: 0.8335
## F-statistic: 323.3 on 13 and 824 DF, p-value: < 0.00000000000000022
##
##
## Response TRB :
##
## Call:
## lm(formula = TRB ~ Salary + Age + Ht + Wt + Pos + GS + G + MP,
## data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67856 -0.16327 -0.00599 0.14571 0.56670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7834089684 0.4172853723 -4.274 0.00002145890238160 ***
## Salary 0.0239028111 0.0117853617 2.028 0.0429 *
## Age -0.0287838037 0.0040692502 -7.073 0.00000000000322845 ***
## Ht 0.0422456728 0.0047999566 8.801 < 0.0000000000000002 ***
## Wt -0.0000009692 0.0000013028 -0.744 0.4571
## PosC-F 0.0645121174 0.0370620109 1.741 0.0821 .
## PosF -0.2823501999 0.0345653715 -8.169 0.00000000000000117 ***
## PosF-C -0.0804293081 0.0341468073 -2.355 0.0187 *
## PosF-G -0.3049418836 0.0409933576 -7.439 0.00000000000025535 ***
## PosG -0.5793483637 0.0547663461 -10.579 < 0.0000000000000002 ***
## PosG-F -0.4408658778 0.0459457864 -9.595 < 0.0000000000000002 ***
## GS 0.0254978069 0.0119331759 2.137 0.0329 *
## G 0.0009863719 0.0004692023 2.102 0.0358 *
## MP 0.0007324771 0.0000559180 13.099 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2141 on 824 degrees of freedom
## (28 observations deleted due to missingness)
## Multiple R-squared: 0.8235, Adjusted R-squared: 0.8208
## F-statistic: 295.8 on 13 and 824 DF, p-value: < 0.00000000000000022
mlm1$coefficients
## PTS AST TRB
## (Intercept) 0.097799276044 5.433722508646 -1.783408968368
## Salary 0.035454465501 -0.001347760411 0.023902811122
## Age -0.038773454576 -0.023261039752 -0.028783803703
## Ht -0.006335313218 -0.055841745027 0.042245672768
## Wt 0.000006346443 -0.000008881267 -0.000000969235
## PosC-F 0.386460141094 0.094221977869 0.064512117403
## PosF 0.106641823562 -0.259636830004 -0.282350199945
## PosF-C 0.148916003387 -0.168620413563 -0.080429308144
## PosF-G 0.314883217453 -0.057699632982 -0.304941883580
## PosG 0.229710889778 -0.121915579031 -0.579348363663
## PosG-F 0.115774719513 -0.261859860325 -0.440865877794
## GS -0.059492746679 0.024231706283 0.025497806949
## G 0.004485253829 0.000912070851 0.000986371888
## MP 0.002301964636 0.000787669714 0.000732477057
We will now consider each predictor variable and see how it affects the response variables.
Salary: We see salary has a signicant effect at the 0.05 level only on Points and Rebounds, but does not have a significant effect on Assists. This is suprising, as it suggests that some in-game stats may not be as related to salary as others. It may suggest that the assist-seeking players may not be as highly compensated for that ability. The model confirms that the relatonship between salary and in-game statistical performance is positive.Age: This is a very significant variable and is negative for all three response variables. This suggests that as players become older, their statistical output goes down. This is not surprising.Ht: This variable is not significant for points but it is for assists and rebounds. Height has a positive relationship with rebounds and a negative relationship with assists. This makes sense because certain basketball positions are determined by height and these positions lead to specific in-game roles which may lead to more assists or more rebounds, but all positions can score points. Taller players are more likely to be centers, who are more likely to get rebounds, and shorter players are more likely to be guards, who are more likely to get assists.GS and G: These variables are similar, so we will group them together. They are highly significant for the points response variable, but not so much for the assists and rebounds, which is a little surprising. Furthermore, the GS variable has a negative coefficient with points - highly surprising. This relationship may require further investigation.MP: This variable is significant and positive for all three response variables. This makes sense because players who play more minutes are more likely to score more points, get more assists, and get more rebounds.Before we proceed further, let us consider the fit of the model. We plot the residuals vs fitted values for each of the three response variables and check the normality of the residuals.
plot(mlm1$fit[,1],mlm1$resid[,1])
plot(mlm1$fit[,2],mlm1$resid[,2])
plot(mlm1$fit[,3],mlm1$resid[,3])
qqnorm(mlm1$resid[,1])
qqline(mlm1$resid[,1])
qqnorm(mlm1$resid[,2])
qqline(mlm1$resid[,2])
qqnorm(mlm1$resid[,3])
qqline(mlm1$resid[,3])
We see that this initial model is pretty good, but perhaps we can do better. We can consider alternate subsets of variables to include in the model. And compare them using the AIC metric.
We can use several methods to help us choose the best model. It is preferable to include the least number of predictor variables possible, in order to have an explanable model, avoid overfitting, etc, but to balance this with the need to have a model that is sufficiently accurate.
For example, we may use the AIC, BIC, or adjusted R-squared to help us choose the best model. We calculate the AIC for our initial model.
# define AIC because the function doesnt support multiple respnses
# (resid sum of squares cross product matrix)/n
# AIC = n * log(det(Sigma_d)) - 2p * d
# where p is the number of parameters and d is the number of responses
n = nrow(per_season_data)
# mlm1 number of parameters
p = 8
d = 3
aic = n * log(det(crossprod(mlm1$residuals))) - 2 * p * d
aic
## [1] 9760.221
We first consider the most simple model of only using Salary variable to predict the in-game statistics.
salary_mlm <- lm(
cbind(PTS, AST, TRB) ~ Salary,
data = per_season_data
)
summary(salary_mlm)
## Response PTS :
##
## Call:
## lm(formula = PTS ~ Salary, data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71942 -0.54134 -0.05293 0.56652 2.16099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.55888 0.16274 3.434 0.000623 ***
## Salary 0.53940 0.03701 14.573 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8199 on 851 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.1997, Adjusted R-squared: 0.1988
## F-statistic: 212.4 on 1 and 851 DF, p-value: < 0.00000000000000022
##
##
## Response AST :
##
## Call:
## lm(formula = AST ~ Salary, data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.23091 -0.41696 -0.06474 0.39725 1.48099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.45558 0.10444 4.362 0.0000144555437827 ***
## Salary 0.19357 0.02375 8.149 0.0000000000000013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5262 on 851 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.07239, Adjusted R-squared: 0.0713
## F-statistic: 66.41 on 1 and 851 DF, p-value: 0.000000000000001304
##
##
## Response TRB :
##
## Call:
## lm(formula = TRB ~ Salary, data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17872 -0.29780 -0.02929 0.30152 1.51288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.66224 0.09472 6.992 0.0000000000055 ***
## Salary 0.28135 0.02154 13.060 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4772 on 851 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.167, Adjusted R-squared: 0.166
## F-statistic: 170.6 on 1 and 851 DF, p-value: < 0.00000000000000022
# aic
n = nrow(per_season_data)
p = 1
d = 3
aic_salary = n * log(det(crossprod(salary_mlm$residuals))) - 2 * p * d
aic_salary
## [1] 13572.94
We see that the AIC value is much higher for the salary-only model, so we will not consider this model further. We will consider one more simpler model, which only includes the Salary and demographic variables.
salary_demographic_mlm <- lm(
cbind(PTS, AST, TRB) ~ Salary + Age + Ht + Wt,
data = per_season_data
)
summary(salary_demographic_mlm)
## Response PTS :
##
## Call:
## lm(formula = PTS ~ Salary + Age + Ht + Wt, data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1210 -0.5361 -0.1170 0.5254 2.2587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.484516484 0.953805609 -0.508 0.61160
## Salary 0.533184613 0.035949642 14.831 < 0.0000000000000002 ***
## Age 0.067685732 0.010181618 6.648 0.0000000000532 ***
## Ht -0.002649727 0.012998656 -0.204 0.83852
## Wt -0.000010486 0.000003887 -2.698 0.00712 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7829 on 848 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.2729, Adjusted R-squared: 0.2695
## F-statistic: 79.57 on 4 and 848 DF, p-value: < 0.00000000000000022
##
##
## Response AST :
##
## Call:
## lm(formula = AST ~ Salary + Age + Ht + Wt, data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19532 -0.25254 -0.03732 0.23958 1.39224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.961957922 0.472385780 8.387 < 0.0000000000000002 ***
## Salary 0.226291042 0.017804571 12.710 < 0.0000000000000002 ***
## Age 0.029963959 0.005042591 5.942 0.000000004103941 ***
## Ht -0.047990884 0.006437769 -7.455 0.000000000000223 ***
## Wt -0.000013754 0.000001925 -7.144 0.000000000001953 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3878 on 848 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4981, Adjusted R-squared: 0.4957
## F-statistic: 210.4 on 4 and 848 DF, p-value: < 0.00000000000000022
##
##
## Response TRB :
##
## Call:
## lm(formula = TRB ~ Salary + Age + Ht + Wt, data = per_season_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2966 -0.2354 -0.0045 0.2348 1.0633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.028651737 0.463750672 -13.000 < 0.0000000000000002 ***
## Salary 0.235154187 0.017479107 13.453 < 0.0000000000000002 ***
## Age 0.035885743 0.004950414 7.249 0.000000000000945 ***
## Ht 0.074546974 0.006320088 11.795 < 0.0000000000000002 ***
## Wt 0.000001713 0.000001890 0.906 0.365
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3807 on 848 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.4719, Adjusted R-squared: 0.4694
## F-statistic: 189.4 on 4 and 848 DF, p-value: < 0.00000000000000022
n = nrow(per_season_data)
p = 4
d = 3
aic_demo = n * log(det(crossprod(salary_demographic_mlm$residuals))) - 2 * p * d
aic_demo
## [1] 12044.69
This model also gives a much higher AIC than the initial model, so we will not consider this model further. Indeed, it appears that the initial model is a well performing model and that including all of those variables is beneficial.